SAMZA-2804: Concurrency issues identified in run-class.sh on samza-yarn#1716
Merged
shanthoosh merged 3 commits intoapache:masterfrom May 2, 2025
Merged
SAMZA-2804: Concurrency issues identified in run-class.sh on samza-yarn#1716shanthoosh merged 3 commits intoapache:masterfrom
shanthoosh merged 3 commits intoapache:masterfrom
Conversation
0f7f6a8 to
5e611eb
Compare
## Race condition in pathing jar manifest creation A race condition exists when setting up the classpath during container launch. During container launch using samza-yarn, run-class.sh creates a pathing jar file (which holds the classpath for the container launch). However, during the creation of this pathing jar, temporary files, as well as the pathing jar itself is not placed in a location unique to the container. This results in multiple containers writing to the same pathing jar location and temporary file location, which results in a race condition. This race condition may show up in several ways, such as when Yarn removes jars from a finished container (other containers will point to a classpath which no longer exists) or when multiple run-class.sh scripts attempt to write the manifest.txt or pathing jar at the same time. Note that host affinity being enabled will make this problem worse. The pathing.jar is written to the usercache, so when the container which created the pathing.jar is finished and removed, any new container which launches on that host will point to jar files which do not exist anymore. When host affinity is enabled, it will not move to a new host and just keep failing. ## Container logging directory fallback is not unique for each container The fallback log directory is the same among all containers running on the same host. It should be unique per-container. ## Container tmp dir is not unique per-container The JAVA_TMP_DIR directory is the same for all containers. We should make sure that it's safe to use the same directory for all containers.
5e611eb to
942e0d1
Compare
Contributor
Author
shanthoosh
reviewed
May 2, 2025
|
|
||
| if [ -z "$SAMZA_LOG_DIR" ]; then | ||
| SAMZA_LOG_DIR="$base_dir" | ||
| # When on samza-yarn, SAMZA_LOG_DIR will point to the symlink located at: |
Contributor
There was a problem hiding this comment.
Thanks for adding the comments. Please remove when on samza-yarn prefix, since this script is applicable only for yarn deployments.
Contributor
Author
There was a problem hiding this comment.
Updated to simpify.
| PATHING_MANIFEST_FILE=$CLASSPATH_WORKSPACE_DIR/manifest.txt | ||
|
|
||
| # jar file to include on the classpath for running the main class | ||
| PATHING_JAR_FILE=$CLASSPATH_WORKSPACE_DIR/pathing.jar |
Contributor
There was a problem hiding this comment.
Would be great if we can log the pathing_jar_file path and the manifest_file path?
|
|
||
| # file containing the classpath string; used to avoid passing long classpaths directly to the jar command | ||
| PATHING_MANIFEST_FILE=$CLASSPATH_WORKSPACE_DIR/manifest.txt | ||
|
|
Contributor
There was a problem hiding this comment.
Would be great to log the location of manifest-file and the pathing-jar-file path.
shanthoosh
approved these changes
May 2, 2025
Contributor
shanthoosh
left a comment
There was a problem hiding this comment.
LGTM. Thanks for making the changes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains fixes to several concurrency related issues when
run-class.shandrun-framework-class.shis used on samza-yarn.BUG=SAMZA-2804
Details
Race condition in pathing jar manifest creation
A race condition exists when setting up the classpath during container launch.
During container launch using samza-yarn, run-class.sh creates a pathing jar file (which holds the classpath for the container launch). However, during the creation of this pathing jar, temporary files, as well as the pathing jar itself is not placed in a location unique to the container. This results in multiple containers writing to the same pathing jar location and temporary file location, which results in a race condition.
This race condition may show up in several ways, such as when Yarn removes jars from a finished container (other containers will point to a classpath which no longer exists) or when multiple run-class.sh scripts attempt to write the manifest.txt or pathing jar at the same time.
Note that host affinity being enabled will make this problem worse. The pathing.jar is written to the usercache, so when the container which created the pathing.jar is finished and removed, any new container which launches on that host will point to jar files which do not exist anymore. When host affinity is enabled, it will not move to a new host and just keep failing.
This has been modified to use a path unique to the container.
Container logging directory fallback is not unique for each container
The fallback log directory is the same among all containers running on the same host.
This has been modified to use a path unique to the container.
Container tmp dir is not unique per-container
The JAVA_TMP_DIR directory is the same for all containers.
This has been modified to use a path unique to the container.
Testing done
Validated that the
manifest.txtandpathing.jarare now located in a per-container directory:Compared to before the patch, which shows them being written to an application-level package cache directory:
See also
An older attempt to solve this issue can be found at #1498, however, that change overlooks that
$base_dirleads to__package, which is a symlink to the application level directory (note the use of readlink in the test example above).